Josh Quan
UC Berkeley Library
Fall 2017
An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.
-John Tukey
How feasible or doable is your research question?
How many observations do you need?
Does the answer to your question have too many angles? If so, then your question might be too broad to answer on time.
| Unit of Analysis | Geography | Time-Period | Frequency |
|---|---|---|---|
| For which level do you want data? Summary or Micro? (individuals, counties, nations) | Is there a geographic component to your topic? (U.S., Sub-Saharan Africa, India) | Do you want a data for a specific time period? (1980-2000, 1930-1960) | How often do you want measures for your variables? (every year, every ten years, monthly, quarterly) |
| Researchers | Government Agencies | NGOs | Research Organizations |
|---|---|---|---|
| Are there people you know who are doing this kind of research? | Think about government agencies - is the request for some official statistics or data that they’d be likely to collect and publish? (industry, agriculture, construction, disease, crime) | Are there councils or interest organizations devoted to the topic that might collect data independently? (HIV/AIDS, drugs, civil rights) | Would any specific research organizations be interested in the topic? (Pew, Roper, Gallup, NORC, NBER, World Bank, OECD) |
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. -Dasu and Johnson, 2003
“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Tidy Data has the following attributes:
Each variable forms a column and contains values
Each observation forms a row
Each type of observational unit forms a table
| Good Example | Bad Example | Description |
|---|---|---|
| gnp2010 | gnp-2002; gnp#2002 | |
| real_int | real interest rate | |
| score1; gnp2003 | 1st_score; 2003gnp | |
| reg_out; glm1 | REG; glm; ttest | |
| invest; interest | xxx; yyy; zmdje; | |
| male; asian | gender; race | |
| citizen | Are_you_a_US_citizen? | |
| income; intUS03 | INCOME; Int_us2003; | |
| 2017-04-20 | April 20, 2017 |
| Good Example | Bad Example | Description |
|---|---|---|
| gnp2010 | gnp-2002; gnp#2002 | avoid special characters |
| real_int | real interest rate | Use underscore |
| score1; gnp2003 | 1st_score; 2003gnp | Begin with a character |
| reg_out; glm1 | REG; glm; ttest | Avoid reserved words |
| invest; interest | xxx; yyy; zmdje; | Use meaningful names |
| male; asian | gender; race | Use a value of dummy |
| citizen | Are_you_a_US_citizen? | The shorter, the better |
| income; intUS03 | INCOME; Int_us2003; | Use lower cases |
| 2017-04-20 | April 20, 2017 | Use common ISO year format |
http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/
url='http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/'
TAB=read_html(url)%>%html_nodes('td')%>%html_text()
NAMES=read_html(url)%>%html_nodes('th')%>%html_text()
M=data.frame(matrix(TAB,ncol=5,nrow=9,byrow=T))
M=cbind(NAMES[7:15],M)
names(M)=NAMES[1:6]
M## Gross indices (2010=100) I II III IV Year
## 1 2008 99.9 101.2 101.0 102.3 101.1
## 2 2009 101.0 99.7 100.5 98.9 100.0
## 3 2010 99.4 99.8 100.0 100.8 100.0
## 4 2011 102.9 103.2 104.5 105.1 103.9
## 5 2012 105.7 106.1 106.0 105.6 105.9
## 6 2013 105.4 105.4 106.7 107.1 106.1
## 7 2014 107.3 107.2 107.4 107.6 107.4
## 8 2015 108.6 108.8 109.3 109.5 109.1
## 9 2016 110.3 110.7 110,8 111,3 110.8